Skip to content

[SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s#21092

Closed
ifilonenko wants to merge 25 commits into
apache:masterfrom
ifilonenko:master
Closed

[SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s#21092
ifilonenko wants to merge 25 commits into
apache:masterfrom
ifilonenko:master

Conversation

@ifilonenko

@ifilonenko ifilonenko commented Apr 18, 2018

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Introducing Python Bindings for PySpark.

  • Running PySpark Jobs
  • Increased Default Memory Overhead value
  • Dependency Management for virtualenv/conda

How was this patch tested?

This patch was tested with

KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run SparkPi with a test secret mounted into the driver and executor pods
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
Run completed in 4 minutes, 28 seconds.
Total number of tests run: 11
Suites: completed 2, aborted 0
Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

@SparkQA

SparkQA commented Apr 18, 2018

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2365/

@SparkQA

SparkQA commented Apr 18, 2018

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2365/

@erikerlandson

Copy link
Copy Markdown
Contributor

Thanks @ifilonenko !
I'm interested in figuring out what it means for the container images to be "python 2/3 generic" - does that imply being able to run either, based on submit parameters?

@foxish

foxish commented Apr 18, 2018

Copy link
Copy Markdown
Contributor

cc @holdenk

"$SPARK_HOME/bin/spark-submit"
--conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
--deploy-mode client
"$@" $PYSPARK_PRIMARY $PYSPARK_SECONDARY

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have more descriptive names for PYSPARK_PRIMARY and PYSPARK_SECONDARY? Maybe PYSPARK_MAINAPP and PYSPARK_ARGS?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

rm -r /usr/lib/python*/ensurepip && \
pip install --upgrade pip setuptools && \
rm -r /root/.cache
ENV PYTHON_VERSION 2.7.13

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set this, are we implicitly imposing a contract on the base image to have this particular version of python installed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I brought up in the PR description. And why this still a WIP. I need to investigate the proper way to determine whether we ship these containers with Python2 or Python3.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in some OSes, python vs python3 symlink to the installed version of python, respectively for the version 2.x and 3.x, is that a better approach then hardcoding the version number?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think it might make sense to build the container with both 2 & 3 since the container might be built by a vendor or cluster administrator and then used by a variety of people. What do folks think?

As for figuring out the env, if we wanted to do it that way we can call the current users python and ask it for its version version information (based on the Spark Python enviroment variables).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a canonical container should include both. My instinct is that a user should be able to "force" the use of one or the other. If someone is invoking spark-submit in cluster-mode, with a supplied python file, some kind of CLI argument (--conf or otherwise) seems like the only totally foolproof way to identify that for the eventual pod construction, but maybe there is a better way?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps re-use PYSPARK_PYTHON?

}

test("Apply Python step if main resource is python.") {
val conf = KubernetesConf(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but @mccheah, should we have something like the fluent/builder pattern here for KubernetesConf since it's grown to quite a few params. I'm happy to take a stab at it if we agree that's a good direction.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on what we want to use the builder for. One advantage of a builder is handled by case classes already: the fact that you don't have to order arguments in a particular way; you can get around this by using named parameters when you construct the object. But, if you want to stage the construction of the object in multiple calls, then a builder will get you that while a case class by itself will not.

I think it would be neater to have a builder. The SparkSession Builder is an example from the project we can follow.

}

val driverContainer = new ContainerBuilder(pod.container)
val withoutArgsDriverContainer: ContainerBuilder = new ContainerBuilder(pod.container)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous name seemed clearer to me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a corresponding driver container with args?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. look below

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we do set arguments on this one right? If not please insert a white space so I can see the different visually.

} else baseFeatures
val maybeRoleSecretNamesStep = if (kubernetesConf.roleSecretNamesToMountPaths.nonEmpty) {
Some(provideSecretsStep(kubernetesConf)) } else None
val allFeatures: Seq[KubernetesFeatureConfigStep] =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not need any changes/arg passing during executor pod construction?

@ifilonenko ifilonenko Apr 18, 2018

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but there will be more features and I thought that doing options in the setting of allFeatures was cleaner

.build()

val driverContainer =
if (driverDockerContainer == "driver-py") {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we can discover if it's a Python application in a better way here. Probably using the built up spark conf?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can check the appResource but that was already done. I thought it would be overkill to check twice since it was already handled in setting driverDockerContainer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general I'd prefer having two separate step types here. They can share some logic in either a utils class or a shared superclass. But you only apply one step type for Java apps vs one step type for Python apps.

Another way is to have the basic driver step only do work that would be strictly agnostic of python vs java, and then have a separate step for either Java or Python; the orchestrator picks which one to invoke based on the app resource type. To do this I think the step's constructor needs to take more than just the KubernetesConf as an argument - it needs to take the appropriate specifically-typed MainAppResource as an argument in the constructor as well. This breaks the convention that we've set so far but for now that's probably ok, as long as we don't get parameter length blowup as we go forward.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second way is the approach that I envisioned and tried to implement. It seems that the approach (without putting too much work on the KubernetesConf) breaks the contract we defined tho.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what about applications which need Python support (e.g. have Python UDFS) but don't use a Python driver process?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what about applications which need Python support (e.g. have Python UDFS) but don't use a Python driver process?

Think that's up to the user to make it work - I don't see this being specifically handled by the other cluster managers.

The goal of this PR should be to bring Kubernetes up to par with the other cluster managers with respect to what they provide.Do the other cluster managers provide any specific support for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently are only running the Python and future R step when we are leveraging a Python (or R) driver process. Else the user would just specify the spark-py docker-image no? and then just continue to run a non-Python driver process.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was forgot that folks could specify the driver container separately from the worker container nvm.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifilonenko I think this still needs some work to clean up.

What I expect to happen is to have three step types:

  1. BasicDriverFeatureStep, which is what's here except we don't provide the args to the container in this step anymore.
  2. PythonDriverFeatureStep which does both what the PythonDriverFeatureStep does currently plus adds the driver-py argument
  3. JavaDriverFeatureStep which only adds the argument SparkLauncher.NO_RESOURCE, conf.roleSpecificConf.appArgs, etc.

Then in the KubernetesDriverBuilder, always apply the first step, and select which of 2 or 3 to apply based on the app resource type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I didn't know if we wanted to include a JavaDriverFeatureStep. I will do so then.

@foxish

foxish commented Apr 18, 2018

Copy link
Copy Markdown
Contributor

Thanks for taking this on @ifilonenko. Left some initial comments on the PR without going too much in depth - since as you noted, it's WIP.

@holdenk holdenk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got some initial feedback and questions but I'm really excited to see the progress. One thing which I'm a little worried about and wasn't aware of is that the integration tests appear to be living in a seperate non-ASF repo? Whats the story behind that and can we do anything to bring those in?

Comment thread bin/docker-image-tool.sh
Options:
-f file Dockerfile to build. By default builds the Dockerfile shipped with Spark.
-f file Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
-p file Dockerfile with Python baked in. By default builds the Dockerfile shipped with Spark.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One (future concern) is how we would to handle the overlay with both Python and R at the same time.

childMainClass = KUBERNETES_CLUSTER_SUBMIT_CLASS
if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
childArgs ++= Array("--primary-java-resource", args.primaryResource)
if (args.isPython) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic appears to duplicated from YARN, would it make sense to factor this out into a common function?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted about this off-line and while its close its not exactly the same so we can deal with minor parts of duplication for now.

val MEMORY_OVERHEAD_FACTOR =
ConfigBuilder("spark.kubernetes.memoryOverheadFactor")
.doc("This sets the Memory Overhead Factor that will allocate memory to non-JVM jobs " +
"which in the case of JVM tasks will default to 0.10 and 0.40 for non-JVM jobs")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this thanks for adding this.

sparkConfWithMainAppJar.set(KUBERNETES_PYSPARK_MAIN_APP_RESOURCE, res)
sparkConfWithMainAppJar.set(KUBERNETES_PYSPARK_APP_ARGS, appArgs.mkString(" "))
}
sparkConfWithMainAppJar.set(MEMORY_OVERHEAD_FACTOR, 0.4)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So wait, if the user has specified a different value I don't think we should override it and its not clear to me that this code will not override a user specified value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very true, will need to ensure that it does not override the set value

.build()

val driverContainer =
if (driverDockerContainer == "driver-py") {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what about applications which need Python support (e.g. have Python UDFS) but don't use a Python driver process?

assert(kubernetesConfWithoutMainJar.sparkConf.get(MEMORY_OVERHEAD_FACTOR) === 0.1)
}

test("Creating driver conf with a python primary file") {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like also see a unit test for with a PyFile and an overriden memory overhead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults are checked on 96 and 117. (But I need to ensure that it is possible to override as well. Will add)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a follow up we should have a test for with Python and overriding MEMORY_OVERHEAD_FACTOR (e.g. test to make sure that setIfMissing since we had it the other way earlier in the PR).

rm -r /usr/lib/python*/ensurepip && \
pip install --upgrade pip setuptools && \
rm -r /root/.cache
ENV PYTHON_VERSION 2.7.13

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think it might make sense to build the container with both 2 & 3 since the container might be built by a vendor or cluster administrator and then used by a variety of people. What do folks think?

As for figuring out the env, if we wanted to do it that way we can call the current users python and ask it for its version version information (based on the Spark Python enviroment variables).

COPY python /opt/spark/python
RUN apk add --no-cache python && \
python -m ensurepip && \
rm -r /usr/lib/python*/ensurepip && \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment about why this part?

python -m ensurepip && \
rm -r /usr/lib/python*/ensurepip && \
pip install --upgrade pip setuptools && \
rm -r /root/.cache

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just being done for space reasons?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

ENV PYTHON_VERSION 2.7.13
ENV PYSPARK_PYTHON python
ENV PYSPARK_DRIVER_PYTHON python
ENV PYTHONPATH ${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.10.6-src.zip:${PYTHONPATH}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need to mention the Py4J zip file needs to be updated here as well :(
Also open question if we want the PySpark.zip file in here instead of the python/, and or if we're trying to make "slim" images if we want to delete that zip file.

@holdenk

holdenk commented Apr 20, 2018

Copy link
Copy Markdown
Contributor

Other not directly related to the code feedback is in the example I would expect sort to be passed as an argument to pi from the quick reading of it and also from using just regular spark submit in local mode so I wouldn't expect spark-submit to not treat it as an argument. Are you just looking to add sort.py as to the users python path so it's included as a resource? If so I think updating the env variables or using --py-files is the way to go. If I've missunderstood that question/example though no stress :)

And thank you so much for working on this I'm super excited to see the progress. Sorry for only the quick first-pass review but I figured since its a work in progress that is what you are looking for. If you want more detailed feedback please ping me :)

@mccheah

mccheah commented Apr 20, 2018

Copy link
Copy Markdown
Contributor

Integration tests are meant to be in this repository but we haven't gotten there yet. See #20697

"$SPARK_HOME/bin/spark-submit"
--conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
--deploy-mode client
"$@" $PYSPARK_PRIMARY $PYSPARK_SECONDARY

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk I thought the PythonRunner takes in a comma delineated string of PyFiles. as an argument which is why I set it to be --class PythonRunner $PYSPARK_PRIMARY $PYSPARK_FILES $PYSPARK_DRIVER_ARGS

@ifilonenko ifilonenko changed the title [SPARK-23984][K8S][WIP] Initial Python Bindings for PySpark on K8s [SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s May 2, 2018
@SparkQA

SparkQA commented May 2, 2018

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2729/

@SparkQA

SparkQA commented May 2, 2018

Copy link
Copy Markdown

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2729/

@erikerlandson

Copy link
Copy Markdown
Contributor

@holdenk I think your comment above gets at a use-case "ambiguity" that containerization causes. There are now at least two choices of channel for supplying dependencies: from the command line, or by customized container (and here there are at least two sub-cases: manually created customizations, or via source-to-image tooling).

When specifying deps via the command line, particularly in cluster mode, we have backed out of staging local files via init-container; does pulling from URI suffice?

@ifilonenko

ifilonenko commented May 2, 2018

Copy link
Copy Markdown
Contributor Author

@shaneknapp @ssuchter integration tests seem to be failing not due to this PR, but in general. Please investigate, because this PR does pass integration tests + an extra PySpark test.

Error:

Error starting host: Temporary Error: Error configuring auth on host: Temporary Error: ssh command error:
command : sudo systemctl -f restart docker

@ifilonenko

Copy link
Copy Markdown
Contributor Author

retest this please

@SparkQA

SparkQA commented May 2, 2018

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2753/

@SparkQA

SparkQA commented May 2, 2018

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2753/

@holdenk

holdenk commented May 4, 2018

Copy link
Copy Markdown
Contributor

@erikerlandson I think pulling from URI is fine for now. The actual comment was just focused on the usage of spark-submit in that case, but I agree longer term we should think about dependencies, especially things which can't just be shipped as zip or pyfiles (but I think that is vNext).

@holdenk holdenk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Really excited to see move from WIP to closer and hope we can get this in soon. I have some more questions and feedback, let me know if you need anything else from me.

childMainClass = KUBERNETES_CLUSTER_SUBMIT_CLASS
if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
childArgs ++= Array("--primary-java-resource", args.primaryResource)
if (args.isPython) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted about this off-line and while its close its not exactly the same so we can deal with minor parts of duplication for now.

case PythonMainAppResource(res) =>
additionalFiles += res
maybePyFiles.foreach{maybePyFiles =>
additionalFiles.appendAll(maybePyFiles.split(","))}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR or JIRA, but for later maybe we should normalize our parsing of input files in a way which allows escape characters and share the logic between Yarn/K8s/Mesos/standalone. What do y'all think? Possible follow up JIRA: https://issues.apache.org/jira/browse/SPARK-24184

sparkConfWithMainAppJar.set(KUBERNETES_PYSPARK_MAIN_APP_RESOURCE, res)
sparkConfWithMainAppJar.set(KUBERNETES_PYSPARK_APP_ARGS, appArgs.mkString(" "))
}
sparkConfWithMainAppJar.setIfMissing(MEMORY_OVERHEAD_FACTOR, 0.4)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to set this in the JVM case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is set later in BaseDriverStep

}

val driverContainer = new ContainerBuilder(pod.container)
val withoutArgsDriverContainer: ContainerBuilder = new ContainerBuilder(pod.container)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we do set arguments on this one right? If not please insert a white space so I can see the different visually.

.build()

val driverContainer =
if (driverDockerContainer == "driver-py") {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was forgot that folks could specify the driver container separately from the worker container nvm.

require(mainResource.isDefined, "PySpark Main Resource must be defined")
val otherPyFiles = kubernetesConf.pyFiles().map(pyFile =>
KubernetesUtils.resolveFileUrisAndPath(pyFile.split(","))
.mkString(":")).getOrElse("")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave a comment that we are switching from "," to ":" to match the format expected by the PYTHONPATH environment variable. ( http://xkcd.com/1987 )

.endEnv()
.addNewEnv()
.withName(ENV_PYSPARK_FILES)
.withValue(if (otherPyFiles == "") {""} else otherPyFiles)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, what is this logic?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add empty env vars - see above.

MAIN_CLASS,
APP_ARGS)
APP_ARGS,
None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still want names.

@SparkQA

SparkQA commented Jun 1, 2018

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3623/

@SparkQA

SparkQA commented Jun 1, 2018

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3623/

@holdenk holdenk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, excited to see this moving forward. Really looking forward to seeing improved integration tests in apache-spark-on-k8s/spark-integration#46.

additionalFiles.appendAll(maybePyFiles.split(","))}
sparkConfWithMainAppJar.set(KUBERNETES_PYSPARK_MAIN_APP_RESOURCE, res)
}
sparkConfWithMainAppJar.set(MEMORY_OVERHEAD_FACTOR, 0.4)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you can see my statement about not overriding the explicitly user provided value in comment on the 20th ("if the user has specified a different value don't think we should override it").

So this logic, as it stands, is K8s specific and I don't think we we can change how YARN chooses its memory overhead in a minor release, so I'd expect this to remain K8s specific until at least 3.0 when we can evaluate if we want to change this in YARN as well.

The memory overhead configuration notice done in the YARN page right now
(see spark.yarn.am.memoryOverhead on http://spark.apache.org/docs/latest/running-on-yarn.html ). So I would document this in http://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties e.g. ./docs/running-on-kubernetes.md).

As for intuitive I'd argue that this actually is more intuitive than what we do in YARN, we know that users who run R & Python need more non-JVM heap space and many users don't know to think about this until their job fails. We can take advantage of our knowledge to handle this setting for the user more often. You can see how often this confuses folks on the list, docs, and stack overflow by looking at "memory overhead exceeded" and "Container killed by YARN for exceeding memory limits" and similar.

@SparkQA

SparkQA commented Jun 1, 2018

Copy link
Copy Markdown

Test build #91394 has finished for PR 21092 at commit 24a704e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.stringConf
.checkValue(pv => List("2", "3").contains(pv),
"Ensure that Python Version is either Python2 or Python3")
.createWithDefault("2")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I reading this right that the default is Python 2? Is there a reason for that? Thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason. I just thought that the major version should default to 2.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only ~18 months of support left for Python 2. Python 3 has been around for 10 years and unless there’s a good reason, I think it should be the default.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am willing to do that: thoughts @holdenk ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with either as the default. While Py2 is officially EOL I think we'll still see PySpark Py2 apps for awhile after.

@SparkQA

SparkQA commented Jun 7, 2018

Copy link
Copy Markdown

Test build #91530 has finished for PR 21092 at commit 6a6d69d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jun 7, 2018

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3692/

@SparkQA

SparkQA commented Jun 7, 2018

Copy link
Copy Markdown

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3692/

@SparkQA

SparkQA commented Jun 7, 2018

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3694/

@SparkQA

SparkQA commented Jun 7, 2018

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3694/

@ifilonenko

ifilonenko commented Jun 7, 2018

Copy link
Copy Markdown
Contributor Author
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run SparkPi with a test secret mounted into the driver and executor pods
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
Run completed in 4 minutes, 28 seconds.
Total number of tests run: 11
Suites: completed 2, aborted 0
Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 05:24 min
[INFO] Finished at: 2018-06-07T18:54:42-04:00
[INFO] Final Memory: 21M/509M
[INFO] ------------------------------------------------------------------------

For new addition to: apache-spark-on-k8s/spark-integration#46

@SparkQA

SparkQA commented Jun 8, 2018

Copy link
Copy Markdown

Test build #91537 has finished for PR 21092 at commit ab92913.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk holdenk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super close, thank you for the integration tests really great work. Just a small improvement in the docs and one small unit test is what I see left. We should make sure other folks have a chance for any last comments but hopefully we can merge this next week unless something surprising comes up :)

assert(kubernetesConfWithoutMainJar.sparkConf.get(MEMORY_OVERHEAD_FACTOR) === 0.1)
}

test("Creating driver conf with a python primary file") {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a follow up we should have a test for with Python and overriding MEMORY_OVERHEAD_FACTOR (e.g. test to make sure that setIfMissing since we had it the other way earlier in the PR).

Comment thread docs/running-on-kubernetes.md Outdated
<td><code>spark.kubernetes.memoryOverheadFactor</code></td>
<td><code>0.1</code></td>
<td>
This sets the Memory Overhead Factor that will allocate memory to non-JVM jobs which in the case of JVM tasks will default to 0.10 and 0.40 for non-JVM jobs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can maybe improve this documentation a little bit. It's not so much how much memory is set aside for non-JVM jobs, it's how much memory is set aside for non-JVM memory, including off-heap allocations, non-JVM jobs (like Python or R), and system processes.

Some(inputPyFiles.mkString(",")))
assert(kubernetesConfWithMainResource.sparkConf.get("spark.jars").split(",")
=== Array("local:///opt/spark/jar1.jar"))
assert(kubernetesConfWithMainResource.sparkConf.get(MEMORY_OVERHEAD_FACTOR) === 0.4)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as we discussed earlier testing this value explicitly configured with Python would be good to have as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@holdenk holdenk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending Jenkins and sign-off from someone with K8s background.

@SparkQA

SparkQA commented Jun 8, 2018

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3712/

@SparkQA

SparkQA commented Jun 8, 2018

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3712/

@mccheah mccheah left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will merge to master.

@asfgit asfgit closed this in 1a644af Jun 8, 2018
@felixcheung

Copy link
Copy Markdown
Member

awesome!

@SparkQA

SparkQA commented Jun 8, 2018

Copy link
Copy Markdown

Test build #91573 has finished for PR 21092 at commit a61d897.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lucashu1

lucashu1 commented Jun 14, 2018

Copy link
Copy Markdown

Sorry in advance if this is the wrong place to be asking this!

Does this PR mean that we'll be able to create SparkContexts using PySpark's SparkSession.Builder with master set to k8s://<...>:<...>, and have the resulting jobs run on spark-on-k8s, instead of on local/standalone?

E.g.:

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('k8s://https://kubernetes:443').getOrCreate()

I'm trying to use PySpark in a Jupyter notebook that's running inside a Kubernetes pod, and have it use spark-on-k8s instead of resorting to using local[*] as master.

Till now, I've been getting an error saying that:

Error: Python applications are currently not supported for Kubernetes.

whenever I try to use k8s://<...> as master.

Thanks!


UPDATE: Stack Overflow question here in case anyone has an answer!

@felixcheung

Copy link
Copy Markdown
Member

@lucashu1 please send your question to stackoverflow or user@spark.apache.org!

.createWithDefault(0.1)

val PYSPARK_MAJOR_PYTHON_VERSION =
ConfigBuilder("spark.kubernetes.pyspark.pythonversion")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for leaving a comment in an ancient PR but I couldn't hold it. Why did we add a configuration to control Python version instead of using the existent PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON?

Doing this in a configuration breaks or disables many things, for example, PEX (https://medium.com/criteo-labs/packaging-code-with-pex-a-pyspark-example-9057f9f144f3) that requires to set PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON manually.

@HyukjinKwon HyukjinKwon Dec 10, 2020

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @dongjoon-hyun too FYI. Conda / virtualenv support enabled by #30486 wouldn't work in Kubernates because of this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon sounds reasonable to include support for that, we just need to agree on a policy for which takes precedence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants